On the incremental addition of regression classes for speaker adaptation
نویسندگان
چکیده
In recent work, we proposed the all-pass t rans form (APT) as the basis of a speaker adaptation scheme intended for use with a large vocabulary speech recognition system. It was shown that APT-based adaptation reduces to a linear transformation of cepstral means, much like the better known maximum likelihood linear regression (MLLR). Due to this linearity, APT-based adaptation can be used in conjunction with speaker-adapted training (SAT), an algorithm for performing maximum likelihood estimation of the parameters of a hidden Markov model when speaker adaptation is to be employed during both training and test. In other work, we proposed a refinement of SAT dubbed single-pass adapted training (SPAT) specifically-tailored for use with the APT. Here we introduce an incremental training procedure intended for use with the APT and multiple regression classes. In a set of speech recognition experiments conducted on the Switchboard Corpus, we obtained a word error rate of 37.9% using APT adaptation, a significant improvement over the 39.5% word error rate achieved with MLLR. 1. I N T R O D U C T I O N Speaker-adapted training (SAT) is an algorithm for performing maximum likelihood estimation of the parameters of an HMM when speaker adaptation is to be employed during'both training and test [l]. SAT can be used with any speaker adaptation scheme employing a linear transformation of cepstral means, including both maximum likelihood linear regression [4] as well as the all-pass transform (APT) based formulation discussed in [5]. In a typical implementation of speaker adaptation, the Gaussian components of an HMM are partitioned into a number of mutually exclusive sets or regression classes; several straightforward modifications of the basic SAT algorithm have been proposed to update the assignment of Gaussian components to classes using a maximum likelihood (ML) criterion. Single-pass adapted training (SPAT), which was introduced in [8], is a variation of SAT tailored specifically for use with APTbased adaptation. SPAT makes extensive use of an HMM with one Gaussian component per state cluster to estimate speaker-dependent APT parameters; these parameters are 0-7803-6293-4/00/$10.00 02000 IEEE. then transferred to the final multiple-mixture HMM in a computationally-efficient manner. The incremental training procedure developed here can be regarded as a combination and further refinement of the procedures mentioned above. It is based on the idea of gradually increasing the amount of detail used in modeling the characteristics of a novel speaker; in this regard, it is similar to the incremental build approach to HMM training favored by HTK, the Hidden Markov Model Toolkit [9]. Modeling detail can be added by increasing the number of regression classes or by increasing the number of parameters specifying each speaker-dependent APT, or both. As in SPAT, all APT parameters are estimated using a singlemixture model and then transferred to the final, multiplemixture model. In this work we concentrate on two of the most important aspects of incremental training: the means by which a class is sub-divided or split, and the means by which a class is chosen for splitting. 2. T R A I N I N G P R O C E D U R E S Here we describe the procedures used in incrementally training an HMM for use with speaker adaptation. 2.1. Speaker-Adapted Training (SAT) As mentioned in the introduction, SAT is an algorithm for performing hlL estimation of HMM parameters when speaker adaptation is to be used during both test and training [l]. Very often the Gaussians of an HMM are partitioned into regression classes and a distinct transformation matrix is estimated for each class. In this case it is possible to assign each Gaussian to a regression class based on the ML criterion [3] during SAT parameter re-estimation. The Optimal Regression Class (ORC) estimation procedure described in [7] and summarized here is a slight departure from that presented in [3] inasmuch as the mean and class assignment of a Gaussian are updated jointly rather than sequentially. Let hk = ( p k , D k ) denote the parameters of the k t h Gaussian, where pk and Dk = diag(a;,, u:,~ . . . u ; , ~ ~ } are respectively the speaker-independent (SI) mean and diagonal covariance. Let z!') denote the ith cepstral feature from speaker s, c t ) the posterior probability that XI!') was
منابع مشابه
Speaker adaptation in the Philips system for large vocabulary continuous speech recognition
The combination of Maximum Likelihood Linear Regression (MLLR) with Maximum a posteriori (MAP) adaptation has been investigated for both the enrollment of a new speaker as well as for the asymptotic recognition rate after several hours of dictation. We show that a least mean square approach to MLLR is quite e ective in conjunction with phonetically derived regression classes. Results are presen...
متن کاملAn on-line incremental speaker adaptation technique for audio stream transcription
In this paper, a novel on-line incremental speaker adaptation technique is proposed for real time transcription applications such as automatic closed-captioning of live TV programs. Differently from previously proposed methods, our technique does not operate at utterance level but instead speaker change detection and clustering as well as speaker adaptation occur over a short chunk of the incom...
متن کاملSpeaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملSpeaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملIncremental largest margin linear regression and MAP adaptation for speech separation in telemedicine applications
In this paper, a novel technique of online incremental speaker adaptation for speech stream separation in telemedicine is proposed. An unsupervised discriminative linear regression technique is developed based on the principle of maximizing the class separation margin to transform model mean. This adaptation approach is called largest margin linear regression (LMLR). Online incremental LMLR and...
متن کامل